Beat the Optimizer
Volume Number: 7
Issue Number: 4
Column Tag: Fortran's World
Beat The Optimizer!
By Jonathan Bell, Clinton, SC
Note: Source code files accompanying article are located on MacTech CD-ROM orsource code disks.
Beat the Optimizer!
[Jon earned a Ph.D. in elementary particle physics from the University of
Michigan by doing a lot of number-crunching in FORTRAN on a VAX. Now he teaches
physics and computer science at Presbyterian College in Clinton, SC, where he first
saw a Macintosh and fell in love with it at first sight. Although he now does much of his
teaching and programming in Pascal, and dabbles in assembler, he’s still fond of
FORTRAN, e specially the dialects with post-FORTRAN-77 extensions.]
In heavy-duty number-crunching applications, the efficiency of the object code
is a major consideration. For this reason, minicomputer and mainframe compilers can
usually optimize the code in various ways. Most Macintosh compilers until recently
have not done a very good job with optimization, but this is starting to change. One
example of this is the Language Systems FORTRAN compiler, version 2.0, which runs
under the Macintosh Programmer’s Workshop (MPW). As Jörg Langowski has pointed
out (MacTutor v6#3), this new compiler does a creditable job of optimizing
array-index calculations in code involving two-dimensional arrays.
This does not mean that assembly language programming has been rendered
obsolete. In this article I would like to demonstrate (a) that it is still relatively easy
to “beat the optimizer” with hand-coded assembly language, and that (b) assembly
language code does not have to be e specially “tricky” or obscure to be effective.
This article will also demonstrate how to use both implementations of Apple’s
floating-point arithmetic package: “software SANE”, which is available on all
Macintoshes via a package in the System file, and “hardware SANE”, which is available
on 68020 and 68030 machines (Mac II or higher) via the 68881 or 68882
floating-point coprocessor chip. (Whenever I mention the 68030, I will also
implicitly include the 68020; likewise the 68882 will also “include” the 68881.)
I will not attempt to teach assembly language programming in general, but will
assume that you have at least some reading knowledge of 680x0 assembly language. For
those who are interested in learning about assembly language, some references are
listed at the end of the article.
Matrix Multiplication
As our “test case” I will use matrix multiplication. To refresh your memory
(or perhaps initialize it), recall that a matrix is simply a two-dimensional table of
numbers, typically represented in a computer program by a two-dimensional array.
By convention, we let the first and second indices be the row and column numbers
respectively. Thus in Pascal or C, A[5,2] denotes the number in the fifth row of the
second column of matrix A. In FORTRAN we use parentheses rather than square
brackets: A(5,2).
If the number of columns in matrix A is the same as the number of rows in
matrix B, or equivalently, if a row of A has the same length as a column of B, then we
can multiply A and B to form a new matrix, C. To calculate C(5,2), for example, we
take the fifth row of A and the second column of B, “match up” the corresponding
elements, multiply them pairwise and finally add up the products:
C(5,2) = A(5,1) * B(1,2)
+ A(5,2) * B(2,2)
+ A(5,3) * B(3,2)
+ ...
+ A(5,M) * B(M,2).
Multiplying two matrices requires three nested loops. The two outer loops cycle
over the rows and columns of C, while the innermost loop cycles over the terms in the
sum for each element of C. In Language Systems FORTRAN we might write the following
subroutine to perform matrix multiplication:
C 1
SUBROUTINE MXMPY (A, B, C, L, M, N)
C------------------------------------------------
C Performs the matrix multiplication A*B = C.
C The dimensions of the matrices must be related C as shown below.
C------------------------------------------------
IMPLICIT NONE
INTEGER I, J, K, L, M, N
EXTENDED A(L,M), B(M,N), C(L,N), SUM
DO I = 1, L
DO J = 1, N
SUM = 0.0X0
DO K = 1, M
SUM = SUM + A(I,K) * B(K,J)
END DO
C(I,J) = SUM
END DO
END DO
END
If you are familiar with FORTRAN, you will note that this code is not standard
FORTRAN 77. It uses several “extensions” of the language:
1. Each DO-loop is terminated with an END DO statement rather than a labeled
CONTINUE statement. The corresponding DO statements do not use a statement
number (label) to specify the end of the loop.
2. The IMPLICIT NONE statement turns off FORTRAN’s default type-specification
rules and forces the programmer to declare all variables with their data types, as
in Pascal.
3. The EXTENDED data type is an 80-bit or 96-bit floating point number as used by
the SANE packages on the Mac; which one is used depends on whether the routine
is compiled with the -mc68882 compiler option. Alternatively, for portability
to other systems, floating-point variables can be declared as REAL (32 bits) or
DOUBLE PRECISION (64 bits), in which case the compiler’s -extended option can
be used to force them to be compiled as if they were EXTENDED.
4. The constant 0.0 should be written in exponential notation as 0.0X0 to tell the
compiler that it’s supposed to be stored in EXTENDED format. Otherwise it will
be stored as a 32-bit REAL and converted to EXTENDED whenever used in
calculations, which wastes time. Again, the compiler’s -extended option will
force REAL constants to be stored as EXTENDED.
Non-FORTRAN programmers should pay special attention to the “variable
dimensions” of the arrays A, B and C. In a FORTRAN subroutine, an array argument
(or parameter, to you Pascal and C people) may be specified with variable dimensions
which are also specified as arguments. Both the array and the dimensions must be
arguments, not local variables. This allows a FORTRAN programmer to write a
subroutine which can perform the same operation on arrays and matrices of different
sizes.
Speed Tests
I wrote a simple test program [Listing #1] which sets up a matrix, multiplies
that matrix by itself ten times, records the time using the Toolbox TickCount function,
and divides by ten to get the time for a single matrix multiplication. The program
repeats this process for 5x5, 10x10, 20x20 and 50x50 matrices. (I will present
results only for 50x50.)
I also wanted to find out what proportion of this time was spent in actual
floating-point calculations as opposed to array-indexing overhead. To do this, I
compiled the subroutine to assembly-language source code (using the compiler’s -a
option), commented out the floating-point operations, renamed it DUMMY, and
assembled it using the MPW Assembler. The test program runs DUMMY ten times,
using the same matrix sizes as with the “real” MXMPY.
The Language Systems FORTRAN compiler has four levels of optimization,
numbered from 0 (no optimization) to 3 (maximum optimization). Repeating the
timing tests for each level revealed that for this example, levels 0 and 1 give the same
results, as do levels 2 and 3. (Of course, I constructed a separate DUMMY for each
optimization level!)
The timings I obtained for 50 x 50 matrix multiplication using the standard
SANE library (“software SANE”) are shown in lines 1 and 2 of the table at the end of
the article, for each of the two levels of optimization, on a Mac SE. The optimization
does in fact reduce the overhead significantly. However, the overhead is only a small
fraction of the total execution time, because software SANE is very slow, so we gain a
speed increase of only about 12%.
The only way to speed up software SANE significantly is to switch to a faster
machine. Lines 3 and 4 of the table show results for a Mac SE/30, demonstrating a
speed increase of about 5x for both levels of optimization.
We can improve the SE/30 figures by taking advantage of the 68882
floating-point coprocessor (FPU). The Language Systems FORTRAN compiler allows
you to specify that you want code which uses the FPU directly (“hardware SANE”).
Recompiling the subroutines and test program using the -mc68030 and -mc68882
options (and creating a new dummy subroutine) gave the results shown in lines 5 and 6
of the table.
Overall, the hardware-SANE versions are 6.6x (opt=1) and 7.9x (opt=2) faster
than the software-SANE versions. Interestingly, the indexing overhead runs faster in
the hardware-SANE versions, probably because fewer machine-language instructions
are necessary to set up the actual floating-point calculations.
How Matrices Work
Before proceeding further, we need to look more closely at how matrices are
implemented in high-level languages. When we store data in a matrix, just where does
it go, and when we retrieve data from a matrix, where does the program look for it?
Normally we picture a matrix as a table of numbers, laid out in a neat
two-dimensional pattern of rows and columns. However, computer memory is not
two-dimensional. It is a one-dimensional linear sequence of addresses. A
programming language that supports matrices must somehow “map” two dimensions
into one. There are two obvious ways to do this.
One way is to take the rows of the matrix one after the other and lay them out
“end to end”, forming a single long horizontal line. If we start at one end of the line and
examine one element after another in sequence until we come to the other end, we find
that the second index (the column number) passes through its complete range of values
repeatedly, with the first index (the row number) changing by one after each cycle.
This is called row-major ordering. It is the method used by Pascal, C, and most other
languages for allocating matrices.
The other way, as you might expect, is to take the columns of the matrix one
after the other and lay them out “end to end”, forming a single long vertical line. Now,
as we step from one element to the next, the first index varies more rapidly than the
second. This is called column-major ordering, and is used by FORTRAN, alone among
the major programming languages.
Using column-major ordering, we can easily verify that whenever a FORTRAN
program needs to access an element of a matrix, it must compute the address as follows:
addr B(K,J) = addr B(1,1) [ the start of the matrix ]
+ (J - 1) * n [ completely filled columns ]
+ (K - 1) * w [ last, incomplete, column ]
In this formula, n is the number of bytes in one complete column of the matrix,
and w is the number of bytes in one matrix element. I’ve kept things simple by
assuming that both the row and column indices have 1 as the lower bound, as is the most
common practice; if you like, you can figure out the formula for the more general case
as an exercise.
The indices J and K usually are variables, so their values have to be fetched from
memory as the program runs. If we declare the matrix in the usual way, by specifying
the size of the matrix explicitly, then the compiler can calculate the parameters n and
w and insert them into the code as constants, possibly as immediate operands in
arithmetic operations.